InforLorV4, Main, Exploration, bibRecord, 008F28

A New Method Based on Context for Combining Statistical Language Models

Identifieur interne : 008F28 ( Main/Exploration ); précédent : 008F27; suivant : 008F29

A New Method Based on Context for Combining Statistical Language Models

Auteurs : David Langlois ; Kamel Smaïli ; Jean-Paul Haton [France]

Source :

RBID : CRIN:langlois01a

English descriptors

KwdEn :
- combination, distant models, statistical language modeling.

Abstract

In this paper we propose a new method to extract from a corpus the histories for which a given language model is better than another one. The decision is based on a measure stemmed from perplexity. This measure allows, for a given history, to compare two language models, and then to choose the best one for this history. Using this principle, and with a 20K vocabulary words, we combined two language models : a bigram and a distant bigram. The contribution of a distant bigram is significant and outperforms a bigram model by 7.5%. Moreover, the performance in Shannon game are improved. We show through this article that we proposed a cheaper framework in comparison to the maximum entropy principle, for combining language models. In addition, the selected histories for which a model is better than another one, have been collected and studied. Almost, all of them are beginnings of very frequently used French phrases. Finally, by using this principle, we achieve a better trigram model in terms of parameters and perplexity. This model is a combination of a bigram and a trigram based on a selected history.

Affiliations:

Links toward previous steps (curation, corpus...)

to stream Crin, to step Corpus: 002F06
to stream Crin, to step Curation: 002F06
to stream Crin, to step Checkpoint: 001544
to stream Main, to step Merge: 009449
to stream Main, to step Curation: 008F28

Le document en format XML

<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en" wicri:score="438">A New Method Based on Context for Combining Statistical Language Models</title>
</titleStmt>
<publicationStmt><idno type="RBID">CRIN:langlois01a</idno>
<date when="2001" year="2001">2001</date>
<idno type="wicri:Area/Crin/Corpus">002F06</idno>
<idno type="wicri:Area/Crin/Curation">002F06</idno>
<idno type="wicri:explorRef" wicri:stream="Crin" wicri:step="Curation">002F06</idno>
<idno type="wicri:Area/Crin/Checkpoint">001544</idno>
<idno type="wicri:explorRef" wicri:stream="Crin" wicri:step="Checkpoint">001544</idno>
<idno type="wicri:Area/Main/Merge">009449</idno>
<idno type="wicri:Area/Main/Curation">008F28</idno>
<idno type="wicri:Area/Main/Exploration">008F28</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en">A New Method Based on Context for Combining Statistical Language Models</title>
<author><name sortKey="Langlois, David" sort="Langlois, David" uniqKey="Langlois D" first="David" last="Langlois">David Langlois</name>
</author>
<author><name sortKey="Smaili, Kamel" sort="Smaili, Kamel" uniqKey="Smaili K" first="Kamel" last="Smaïli">Kamel Smaïli</name>
</author>
<author><name sortKey="Haton, Jean Paul" sort="Haton, Jean Paul" uniqKey="Haton J" first="Jean-Paul" last="Haton">Jean-Paul Haton</name>
<affiliation><country>France</country>
<placeName><settlement type="city">Nancy</settlement>
<region type="region" nuts="2">Grand Est</region>
<region type="region" nuts="2">Lorraine (région)</region>
</placeName>
<orgName type="laboratoire" n="5">Laboratoire lorrain de recherche en informatique et ses applications</orgName>
<orgName type="university">Université de Lorraine</orgName>
<orgName type="institution">Centre national de la recherche scientifique</orgName>
<orgName type="institution">Institut national de recherche en informatique et en automatique</orgName>
</affiliation>
</author>
</analytic>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>combination</term>
<term>distant models</term>
<term>statistical language modeling</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en" wicri:score="3746">In this paper we propose a new method to extract from a corpus the histories for which a given language model is better than another one. The decision is based on a measure stemmed from perplexity. This measure allows, for a given history, to compare two language models, and then to choose the best one for this history. Using this principle, and with a 20K vocabulary words, we combined two language models : a bigram and a distant bigram. The contribution of a distant bigram is significant and outperforms a bigram model by 7.5%. Moreover, the performance in Shannon game are improved. We show through this article that we proposed a cheaper framework in comparison to the maximum entropy principle, for combining language models. In addition, the selected histories for which a model is better than another one, have been collected and studied. Almost, all of them are beginnings of very frequently used French phrases. Finally, by using this principle, we achieve a better trigram model in terms of parameters and perplexity. This model is a combination of a bigram and a trigram based on a selected history.</div>
</front>
</TEI>
<affiliations><list><country><li>France</li>
</country>
<region><li>Grand Est</li>
<li>Lorraine (région)</li>
</region>
<settlement><li>Nancy</li>
</settlement>
<orgName><li>Centre national de la recherche scientifique</li>
<li>Institut national de recherche en informatique et en automatique</li>
<li>Laboratoire lorrain de recherche en informatique et ses applications</li>
<li>Université de Lorraine</li>
</orgName>
</list>
<tree><noCountry><name sortKey="Langlois, David" sort="Langlois, David" uniqKey="Langlois D" first="David" last="Langlois">David Langlois</name>
<name sortKey="Smaili, Kamel" sort="Smaili, Kamel" uniqKey="Smaili K" first="Kamel" last="Smaïli">Kamel Smaïli</name>
</noCountry>
<country name="France"><region name="Grand Est"><name sortKey="Haton, Jean Paul" sort="Haton, Jean Paul" uniqKey="Haton J" first="Jean-Paul" last="Haton">Jean-Paul Haton</name>
</region>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Wicri/Lorraine/explor/InforLorV4/Data/Main/Exploration

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 008F28 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 008F28 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Wicri/Lorraine
   |area=    InforLorV4
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     CRIN:langlois01a
   |texte=   A New Method Based on Context for Combining Statistical Language Models
}}

This area was generated with Dilib version V0.6.33.
Data generation: Mon Jun 10 21:56:28 2019. Site generation: Fri Feb 25 15:29:27 2022

	Serveur d'exploration sur la recherche en informatique en Lorraine
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration sur la recherche en informatique en Lorraine

A New Method Based on Context for Combining Statistical Language Models

A New Method Based on Context for Combining Statistical Language Models

Source :

English descriptors

Abstract

Links toward previous steps (curation, corpus...)

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri